Homework 1¶

Michał Orzyłowski¶

Dataset¶

I chose a churn dataset. It is a dataset relating characteristics of telephony account features and usage and whether or not the customer churned.

In plots placed below we can see that all feature variables are normally distributed. That is important because in such case no data transformation is needed.

plot1.png

However, predicted variable has quite unbalanced distribution. For that reason, in order to evaluate model appropriately, beside standard accuracy metric I also compute f1-score and recall metrics.

plot2.png

Models¶

I chose three models:

  • Logistic regression
  • Random forest
  • TabPFN

comparison.png

We can see that it was worth to use recall and f1-score metrics.

Logistic regression model achieves decent accuracy but at the same time its recall and f1-score are terrible. Random forest performs much better and significantly outperforms regression. It also overfits heavily as it gets nearly 100% score on train dataset in all metrics.

But the best model overall is TabPFN. I fitted it with data limited to 1000 rows due to issue with RAM consumption. However, it was still able to obtain better results on test dataset than random forest. The difference was not as big here as between first two models but score achieved by random forest was already quite good.

Appendix¶

In [ ]:
!pip install -q tabpfn
In [70]:
import pandas as pd
import matplotlib.pyplot as plt

Dataset exploration¶

In [71]:
data = pd.read_csv("churn.csv", index_col=0)
print(f'Dataset has {data.shape[0]} rows and {data.shape[1]} columns')
display(data.head())
Dataset has 5000 rows and 9 columns
total_day_minutes total_day_charge total_eve_minutes total_eve_charge total_night_minutes total_night_charge total_intl_minutes total_intl_charge TARGET
0 265.1 45.07 197.4 16.78 244.7 11.01 10.0 2.70 0
1 161.6 27.47 195.5 16.62 254.4 11.45 13.7 3.70 0
2 243.4 41.38 121.2 10.30 162.6 7.32 12.2 3.29 0
3 299.4 50.90 61.9 5.26 196.9 8.86 6.6 1.78 0
4 166.7 28.34 148.3 12.61 186.9 8.41 10.1 2.73 0

All feature variables are normally distributed. However, predicted variable has quite unbalanced distribution.

In [72]:
fig, ax = plt.subplots(3, 3, figsize=(14, 10))
for i in range(8):
    ax[i // 3, i % 3].hist(data.iloc[:, i])
    ax[i // 3, i % 3].set_title(data.iloc[:, i].name)

fig.tight_layout()
In [73]:
data['TARGET'].plot(kind='hist')
Out[73]:
<AxesSubplot: ylabel='Frequency'>
In [74]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score

x, y = data.iloc[:, :-1], data.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

def compute_metrics(pred_train, pred_test):
    train_acc = accuracy_score(y_train[:pred_train.shape[0]], pred_train)
    train_f1 = f1_score(y_train[:pred_train.shape[0]], pred_train)
    train_recall = recall_score(y_train[:pred_train.shape[0]], pred_train)
    train_roc_auc = roc_auc_score(y_train[:pred_train.shape[0]], pred_train)

    test_acc = accuracy_score(y_test, pred_test)
    test_f1 = f1_score(y_test, pred_test)
    test_recall =  recall_score(y_test, pred_test)
    test_roc_auc = roc_auc_score(y_test, pred_test)

    display(pd.DataFrame(
        {
            'Train': [train_acc, train_f1, train_recall, train_roc_auc],
            'Test': [test_acc, test_f1, test_recall, test_roc_auc]
        },
        index=['Accuracy', 'f1-score', 'Recall', 'ROC AUC']
    ))

Logistic regression model¶

In [75]:
from sklearn.linear_model import LogisticRegression

logisticRegression = LogisticRegression().fit(x_train, y_train)
compute_metrics(logisticRegression.predict(x_train), logisticRegression.predict(x_test))
Train Test
Accuracy 0.860500 0.865000
f1-score 0.037931 0.055944
Recall 0.019366 0.028777
ROC AUC 0.509537 0.514388

Random forest model¶

In [76]:
from sklearn.ensemble import RandomForestClassifier

randomForest = RandomForestClassifier().fit(x_train, y_train)
compute_metrics(randomForest.predict(x_train), randomForest.predict(x_test))
Train Test
Accuracy 0.999750 0.889000
f1-score 0.999119 0.468900
Recall 0.998239 0.352518
ROC AUC 0.999120 0.664064

TabPFN model¶

In [77]:
from tabpfn import TabPFNClassifier

train_size = 1000  # Take only a part of data because there is an issue with RAM consumption when fitting on full dataset.
tabPFN = TabPFNClassifier(device='cpu', N_ensemble_configurations=10).fit(x_train[:train_size], y_train[:train_size])
compute_metrics(tabPFN.predict(x_train[:train_size]), tabPFN.predict(x_test))
Loading model that can be used for inference only
Using a Transformer with 25.82 M parameters
Train Test
Accuracy 0.907000 0.895000
f1-score 0.532663 0.482759
Recall 0.395522 0.352518
ROC AUC 0.690833 0.667548